17 research outputs found
Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families
We study online learning under logarithmic loss with regular parametric
models. Hedayati and Bartlett (2012b) showed that a Bayesian prediction
strategy with Jeffreys prior and sequential normalized maximum likelihood
(SNML) coincide and are optimal if and only if the latter is exchangeable, and
if and only if the optimal strategy can be calculated without knowing the time
horizon in advance. They put forward the question what families have
exchangeable SNML strategies. This paper fully answers this open problem for
one-dimensional exponential families. The exchangeability can happen only for
three classes of natural exponential family distributions, namely the Gaussian,
Gamma, and the Tweedie exponential family of order 3/2. Keywords: SNML
Exchangeability, Exponential Family, Online Learning, Logarithmic Loss,
Bayesian Strategy, Jeffreys Prior, Fisher Information1Comment: 23 page
Learning to Crawl
Web crawling is the problem of keeping a cache of webpages fresh, i.e.,
having the most recent copy available when a page is requested. This problem is
usually coupled with the natural restriction that the bandwidth available to
the web crawler is limited. The corresponding optimization problem was solved
optimally by Azar et al. [2018] under the assumption that, for each webpage,
both the elapsed time between two changes and the elapsed time between two
requests follow a Poisson distribution with known parameters. In this paper, we
study the same control problem but under the assumption that the change rates
are unknown a priori, and thus we need to estimate them in an online fashion
using only partial observations (i.e., single-bit signals indicating whether
the page has changed since the last refresh). As a point of departure, we
characterise the conditions under which one can solve the problem with such
partial observability. Next, we propose a practical estimator and compute
confidence intervals for it in terms of the elapsed time between the
observations. Finally, we show that the explore-and-commit algorithm achieves
an regret with a carefully chosen exploration horizon.
Our simulation study shows that our online policy scales well and achieves
close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202
Open problem: Fast and optimal online portfolio selection
Online portfolio selection has received much attention in the COLT community since its introduction by Cover, but all state-of-the-art methods fall short in at least one of the following ways: they are either i) computationally infeasible; or ii) they do not guarantee optimal regret; or iii) they assume the gradients are bounded, which is unnecessary and cannot be guaranteed. We are interested in a natural follow-the-regularized-leader (FTRL) approach based on the log barrier regularizer, which is computationally feasible. The open problem we put before the community is to formally prove whether this approach achieves the optimal regret. Resolving this question will likely lead to new techniques to analyse FTRL algorithms. There are also interesting technical connections to self-concordance, which has previously been used in the context of bandit convex optimization
Quantum learning: asymptotically optimal classification of qubit states
Pattern recognition is a central topic in learning theory, with numerous applications such as voice and text recognition, image analysis and computer diagnosis. The statistical setup in classification is the following: we are given an i.i.d. training set (X1, Y1),...,(Xn, Yn), where X i represents a feature and Y in {0, 1} is a label attached to that feature. The underlying joint distribution of (X, Y) is unknown, but we can learn about it from the training set, and we aim at devising low error classifiers f: X→Y used to predict the label of new incoming features. In this paper, we solve a quantum analogue of this problem, namely the classification of two arbitrary unknown mixed qubit states. Given a number of 'training' copies from each of the states, we would like to 'learn' about them by performing a measurement on the training set. The outcome is then used to design measurements for the classification of future systems with unknown labels. We found the asymptotically optimal classification strategy and show that typically it performs strictly better than a plug-in strategy, which consists of estimating the states separately and then discriminating between them using the Helstrom measurement. The figure of merit is given by the excess risk equal to the difference between the probability of error and the probability of error of the optimal measurement for known states. We show that the excess risk scales as n^{–1} and compute the exact constant of the rate
Following the Flattened Leader
We analyze the regret, measured in terms of log loss, of the maximum likelihood (ML) sequential prediction strategy. This “follow the leader ” strategy also defines one of the main versions of Minimum Description Length model selection. We proved in prior work for single parameter exponential family models that (a) in the misspecified case, the redundancy of follow-the-leader is not 1 2 log n+O(1), as it is for other universal prediction strategies; as such, the strategy also yields suboptimal individual sequence regret and inferior model selection performance; and (b) that in general it is not possible to achieve the optimal redundancy when predictions are constrained to the distributions in the considered model. Here we describe a simple “flattening” of the sequential ML and related predictors, that does achieve the optimal worst case individual sequence regret of (k/2)log n + O(1) for k parameter exponential family models for bounded outcome spaces; for unbounded spaces, we provide almost-sure results. Simulations show a major improvement of the resulting model selection criterion
Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density Estimation
The paper considers sequential prediction of individual sequences with log loss (online density estimation) using an exponential family of distributions. We first analyze the regret of the maximum
likelihood ("follow the leader") strategy. We find that this strategy is (1) suboptimal and (2) requires an additional assumption about boundedness of the data sequence. We then show that both problems can be be addressed by adding the currently predicted outcome to the calculation of the maximum likelihood, followed by normalization of the distribution. The strategy obtained in this way is known in the literature as the sequential normalized maximum likelihood or last-step minimax strategy. We show for the first time that for general exponential families, the regret is bounded by the familiar (k/2) log(n) and thus optimal up to O(1). We also show the relationship to the Bayes strategy with Jeffreys' prior
Bipartite Ranking through Minimization of Univariate Loss
Minimization of the rank loss or, equivalently, maximization of the AUC in bipartite ranking calls for minimizing the number of disagreements between pairs of instances. Since the complexity of this problem is inherently quadratic in the number of training examples, it is tempting to ask how much is actually lost by minimizing a simple univariate loss function, as done by standard classification methods, as a surrogate. In this paper, we first note that minimization of 0/1 loss is not an option, as it may yield an arbitrarily high rank loss. We show, however, that better results can be achieved by means of a weighted (cost-sensitive) version of 0/1 loss. Yet, the real gain is obtained through margin-based loss functions, for which we are able to derive proper bounds, not only for rank risk but, more importantly, also for rank regret. The paper is completed with an experimental study in which we address specific questions raised by our theoretical analysis
Learning Eigenvectors for Free
We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu^T where u is a vector in R^n of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n^2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worstcase sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret